# Regex and digitized books

## Get text data

Let us download one of the digitized books from The Royal Danish Library. The digitized books are ocr scanned (Optical Character Recognition) and available as pdf-files.

The book is [Macdonald, James. Travels through Denmark and Part of Sweden during the Winter and Spring of the Year 1809 : Containing Authentic Particulars of the Domestic Condition of Those Countries, the Opinions of the Inhabitants, and the State of Agriculture. 2015.](https://soeg.kb.dk/permalink/45KBDK_KGL/1pioq0f/alma99122806627205763)

We download the book, extract the text content and store it in a variable called full_text.


In [2]:
#! pip install PyPDF2
import requests
from io import BytesIO
from PyPDF2 import PdfReader

# URL to the ocr scanned, pdf verison of the text
url = "https://www.kb.dk/e-mat/dod/130014244515_bw.pdf"

# Download the pdf file
response = requests.get(url)
response.raise_for_status()  # Check if the request was successful

# Open the pdf file in memory
pdf_file = BytesIO(response.content)

# Create a PDF reader object
reader = PdfReader(pdf_file)

# Extract text from each page starting from page 5
text_content = []
for page in reader.pages[4:]:
    text_content.append(page.extract_text())

# Join all the text content into a single string
full_text = "\n".join(text_content)

# Print the extracted text
print(full_text[0:1000])


TRAVELS
\
THROUGH
D E N M A B K ,
%
AND PART OF
S W E D E N ,
%
DURING THE WINTER AND SPRING OF THE YEAR
1809 :
i
CONTAINING'
f »
#
é •
AUTHENTIC PARTICULARS OF THE DOMESTIC CONDITION  
OF THOSE COUNTRIES, THE OPINIONS OF THE INHA-
BITANTS, AND THE STATE OF AGRICULTURE.
'  /
B Y  JAM ES MACDONALD.
f\
\
\
LONDON:
^  4
PRINTED FOR RICHARD PH ILLIPS,
BRIDGE STREET, BLACKFRIARS,  
liy B. MC MIEI.AN, BOW STREET, COVENT GARDEN.'

5 .
I.’ V - ■
II
l
V.‘  l';
\ *  ;• •(
I
* :  '* J  \;■>- •  4
1  ^
i
1
\
i
t
>
\if; ■ '■V
5 , ■  ‘ , r
r *  . ■ ; '  p O V : t h .' V ' : H
*x't;'r. V  i . }C ,"'.. ■  V r f V
’ f'At ’ -V * ■ ' .  f\*
■ r.-'i , r ' - 1.;' ;*-L  '^ V . - 't  .  ■  ' ■  '*  - c v J * '  •  j
' Z " ^ x - ,  u'.-'--V "
;' - i.• ■  i\

ADVERTISEMENT.
IN  a letter to the Publisher, the Author of the follow«  
ing pages informs -him, that his desultory Jovirnal was  
written during a tour in the Spring of 1809, through some  
Danish and Swedish Provinces; and he cannot but think,  
that 

In [3]:
reader.pages[0].extract_text()

'Digitaliseret af | Digitised by\nForfatter(e) | Author(s): Macdonald, James.\nTitel | Title: Travels through Denmark and part of Sweden\nduring the winter and spring of the year\n1809 : Containing authentic particulars of the\ndomestic condition of those countries, the\nopinions of the inhabitants, and the state of\nagriculture.\nUdgivet år og sted | Publication time and place: London : Richard Phillips, 1810\nFysiske størrelse | Physical extent: 88 s.\nDK\nMaterialet er fri af ophavsret. Du kan kopiere, ændre, distribuere eller\nfremføre værket, også til kommercielle formål, uden at bede om tilladelse.\nHusk altid at kreditere ophavsmanden.\nUK\nThe work is free of copyright. You can copy, change, distribute or present the\nwork, even for commercial purposes, without asking for permission. Always\nremember to credit the author.'

## RegEx to clean / preprocess text 

When we work with Regex, the website [regex101.com](https://regex101.com/) is a brilliant tool. We can help partly to understand Regex, partly to write a Regex pattern. It is a good idea to take ten minutes to familiarize yourself with the page.

Try inserting this text string in the 'TEXT STRING' field: 

_He observes, that he has not only committed to paper his  
own opinions, but also, those of persons with wliom he con-  
versed in the above-mcnt ioned eountries_ i

In the 'REGULAR EXPRESSION' field you can write this pattern `'W+\'`.

What happens when you start typing?


In the field EXPLANATION, at the top of the right side, you can read an explanation of the regex pattern.

### Write a function to clean text

The [Python function](https://www.w3schools.com/python/python_functions.asp) 'clean' below we use to clean the text of all characters other than letters.


In [4]:
import re
def clean(text): 
    
    # match a variety of punctuation and special characters
    # backslash \ and the pipe symbols | plays important roles, for example here \? 
    # Now it is a good idea to look up a see what \ and | does 
    text = re.sub(r'\.|,|:|;|!|\?|\(|\)|\||\+|\'|\"|‘|’|“|”|\'|\’|…|\-|_|–|—|\$|&|\*|>|<|\/|\[|\]', ' ', text)

    # Regex pattern to match numbers and words containing numbers
    text = re.sub(r'\b\w*\d\w*\b', '', text)
     

    # lower the letters
    text = text.lower()

    # replace sequences of non-word characters ('\W+') with a single space. 
    # The 'strip()' removes any leading or tailing whitespaces that could come from the substitution.
    text = re.sub(r'\W+', ' ', text).strip()
    
    return text
    

text = clean(full_text)

In [5]:
text[:3000]

'travels through d e n m a b k and part of s w e d e n during the winter and spring of the year i containing f é authentic particulars of the domestic condition of those countries the opinions of the inha bitants and the state of agriculture b y jam es macdonald f london printed for richard ph illips bridge street blackfriars liy b mc miei an bow street covent garden i v ii l v l i j i i t if v r r p o v t h v h x t r v i c v r f v f at v f r i r l v t c v j j z x u v i i advertisement in a letter to the publisher the author of the follow ing pages informs him that his desultory jovirnal was written during a tour in the spring of through some danish and swedish provinces and he cannot but think that its perusal may draw the attention of onr countrymen from the temporary subjects of the moment to such matters as are connected with the permanent interests of the britisk empire he observes that he has not only committed to paper his own opinions but also those of persons with wliom he con

## w+ together with \b

Why does anything happen on sunday, monday or yesterday?

Find words with special endings, e.g. _day_, can be a help to gain insight into where and when the literature takes place.

The regex pattern `\w+day` is used to find sequences of word characters that end with the letter 'day'.

`\b` finds the bounderies of where letters starts or ends.

You can also use the endings to find grammatical forms, e.g. words with a special suffix like '-ly' would be relatively easy to identify. Try it.





In [6]:
ending = re.findall(r'\w+day\b', text)
print(ending)

['sunday', 'monday', 'yesterday', 'yesterday', 'terday', 'yesterday', 'yesterday', 'yesterday', 'yesterday']


## More metacharacters, as well as pipes, lists and question marks

In literature, comparisons are often used to illustrate points more clearly by putting pictures on what you want to describe. Comparisons also contribute to making the text more lively and interesting.

But regex makes it a manageable task to retrieve examples of comparisons, because we can find text strings that follow the pattern in a typical comparison.

We can illustrate it in the following way. We look for phrases whose pattern is either as a ... or as an ....

There are two different ways.

First way is to use pipe `|`. Pipe means "or". The regex pattern will then look like this: `'as\sa\s\w+|as\san\s\w+'`

Another way is to use the list character `[]`? It looks like this: `'as\sa[n]?\s\w+'`.

In the list, letters can be added that can stand in that place in the word. The question mark indicates that the letter may or may not be there.

In [7]:
comparison = re.findall(r'as\sa\s\w+', text)
print (comparison)

['as a duty', 'as a decoy', 'as a short', 'as a sensible', 'as a considerable', 'as a spy', 'as a guard', 'as a stranger', 'as a german', 'as a troop', 'as a prison', 'as a guard', 'as a boat', 'as a midshipman', 'as a one', 'as a matter', 'as a real', 'as a coward', 'as a proof', 'as a real', 'as a prisoaeria', 'as a very', 'as a bitter', 'as a long', 'as a v', 'as a knowledge', 'as a matter', 'as a biessing', 'as a dead', 'as a military', 'as a peculiar', 'as a good', 'as a bad']


In [8]:
comparison = re.findall(r'as\sa\s\w+|as\san\s\w+', text)
print (comparison)

['as a duty', 'as a decoy', 'as an arduous', 'as a short', 'as a sensible', 'as a considerable', 'as a spy', 'as a guard', 'as a stranger', 'as a german', 'as a troop', 'as a prison', 'as a guard', 'as a boat', 'as a midshipman', 'as a one', 'as a matter', 'as an odious', 'as a real', 'as a coward', 'as a proof', 'as an arrow', 'as a real', 'as a prisoaeria', 'as a very', 'as a bitter', 'as a long', 'as a v', 'as a knowledge', 'as an abundance', 'as a matter', 'as an university', 'as a biessing', 'as a dead', 'as a military', 'as an inde', 'as a peculiar', 'as a good', 'as a bad']


In [9]:
comparison = re.findall(r'as\sa[n]?\s\w+', text)
print (comparison)

['as a duty', 'as a decoy', 'as an arduous', 'as a short', 'as a sensible', 'as a considerable', 'as a spy', 'as a guard', 'as a stranger', 'as a german', 'as a troop', 'as a prison', 'as a guard', 'as a boat', 'as a midshipman', 'as a one', 'as a matter', 'as an odious', 'as a real', 'as a coward', 'as a proof', 'as an arrow', 'as a real', 'as a prisoaeria', 'as a very', 'as a bitter', 'as a long', 'as a v', 'as a knowledge', 'as an abundance', 'as a matter', 'as an university', 'as a biessing', 'as a dead', 'as a military', 'as an inde', 'as a peculiar', 'as a good', 'as a bad']


## Curly brackets for a concordance tool

We're going to try using curly brackets in our RegEx, and we'll try it out in a concrete example of how curly brackets can be used to build a concordance tool. A concordance tool is used to extract text snippets based on keywords and a range.

We will find snippets containing the keyword _eye_. It is a concrete example of how we can point down in the text and see how the term is used exactly. In horror novels, I would imagine that words like eyes is playing a special role.

For the task we would also need the full stop sign( `.` ), because it returns any word characters and the curly brackets like this: `{30}.` It checks that we get 30 word characters.

In [10]:
concordance_tool = re.findall(r'.{30}\beye[s]?\b.{30}', text)

concordance_tool

['trict around us as far as the eye could reach on a clear day th',
 ' him and the expression of my eyes and features were a complete ',
 'see them drowned before their eyes some of the stoutest and who ',
 'nd however is a relief to the eye on a broad passage and though',
 ' spread like a map under your eyes the isle of amak which is the',
 'n the island of saltholm your eye follows the swedish coast for',
 ' forfarshire dialect and with eyes sparkling with pleasure for i',
 'ack and very seldoni red long eye lashes beauti fully arched ey']

## Square brackets [A-Z]

Find words that begins with capital letters.

In [11]:
upper_case_words = re.findall(r'[A-Z]\w+', full_text)
upper_case_words = [i for i in upper_case_words if i.lower() not in full_text]

print (set(upper_case_words))

{'Powcrs', 'DANES', 'Thine', 'JiLin', 'Forfarshire', 'Cullen', 'Droch', 'DENMARK', 'Liimfiord', 'Hamburgh', 'VCDONALD', 'Caltcgat', 'Goschen', 'Jews', 'Norwegian', 'Catteau', 'Norvvegians', 'Dryer', 'Engaged', 'Sæbye', 'RELIGION', 'Highlanders', 'Biisching', 'Henne', 'Jiatred', 'Sourid', 'Tuiew', 'Banes', 'BH', 'Osterbrandersler', 'Drackenberg', 'Windows', 'Banish', 'December', 'Tntland', 'Iaw', 'TUMULT', 'Jullaud', 'Wellesley', 'Salzburgh', 'Aalborgers', 'MIEI', 'Odense', 'Kuttner', 'Kongs', 'Scotch', 'Rmean', 'ItTS', 'Southern', 'Tostand', 'Faroe', 'Amak', 'Ordnance', 'Ljttle', 'Ruris', 'Hobroe', 'Blote', 'Copenhngen', 'Greek', 'Eisinore', 'Amongothers', 'Christmas', 'Butter', 'Mauy', 'Briton', 'Scotcli', 'Gerinan', 'Swedcs', 'Dauish', 'Botany', 'England', 'Frpnch', 'ELSINOItE', 'Sciences', 'Lessoe', 'Chinese', 'Englisli', 'ACDOXALD', 'Jut', 'Tycho', 'Iltese', 'Utility', 'Gertnany', 'Duntzvelt', 'April', 'Ilolstein', 'Aenf', 'CHARACTERISTICS', 'Langeland', 'Sundays', 'Norvvay', 'Thes

## Fuzzy searches in ocr text

Regex can also be used for performing a fuzzy search on OCR processed text. Let's try to locate instances of the word "danish" or "danili" within the text, allowing for slight variations or errors that might occur during OCR processing.

`.{30}`: Matching any 30 characters before and after the target word, providing context around the match.

`danis[h|li]?`: Looking for the word "danish" or "danili", where:

`danis` is the fixed part of the word.

`[h|li]?` allows for either "h" or "li" to follow "danis", accommodating potential OCR errors or variations.


In [12]:
fuzzy_search = re.findall(r'.{30}danis[h|li]?.{30}', text)

fuzzy_search

['in the spring of through some danish and swedish provinces and he ',
 'countries and procure for the danish prisoners in britain as kind ',
 ' opportu nity of thanking the danish monarch and nation for their ',
 'mber f o n f in e d here in a danisli prison i have abundance of t',
 'be plundered the moment their danish enemies could ap proach the s',
 ' a lifeless con dition by the danish boatmen but was soon restored',
 'leaned and at the rate of one danish mile per hour they devise man',
 'ore where in the space of one danisli or four english milesand thr',
 'of the sailors and one of the danish soldiers were much bruised by',
 'e and lieutenant henne of the danish navy treated me with all poss',
 't any ceremony six masters of danish pri vateers half drunk togeth',
 'en and brought in bertfby thc danish pri vateers there are twenty ',
 'e language commonly spoken is danish but the peoplc ot rank and ed',
 'jackets of their own and some danish jackets or great coats lent t',
 'asts